-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flow Visibility] Add flow-aggregator with clickhouse related e2e tests #3673
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3673 +/- ##
==========================================
- Coverage 64.59% 57.25% -7.35%
==========================================
Files 278 392 +114
Lines 39513 55091 +15578
==========================================
+ Hits 25523 31541 +6018
- Misses 12009 21085 +9076
- Partials 1981 2465 +484
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -1147,3 +1376,55 @@ func matchSrcAndDstAddress(srcIP string, dstIP string, isDstService bool, isIPv6 | |||
} | |||
return srcField, dstField | |||
} | |||
|
|||
type ClickHouseFullRow struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not defined anywhere else in the Flow Aggregator code? I am just worried about keeping things consistent between the FA code and the test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the struct is defined in clickhouseclient.go, but:
- the Json tag binding is only used for test code. Not sure if it makes sense to include those tags there just for the sake of this e2e testing unmarshalling
- (nit) attrs are not exported / public in that file
- certain db side columns are not used as part of client (TimeInserted, Trusted), thus not included under fa.
I don't think there's too much overhead in having two structs here. If anything new is added on the FA side, the test code should be updated to take those in anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON tags would definitely not be an issue, however it sounds like there are other reasons for keeping them separate.
if err != nil || rc != 0 { | ||
// ClickHouseInstallation CRD from ClickHouse Operator install bundle applied soon before | ||
// applying CR. Sometimes apiserver validation fails to recognize resource of | ||
// kind: ClickHouseInstallation. Retry in such scenario. | ||
if strings.Contains(stderr, "ClickHouseInstallation") || strings.Contains(stdout, "ClickHouseInstallation") { | ||
return false, nil | ||
} | ||
return false, fmt.Errorf("error when deploying the flow visibility YML %s: %s, %s, %v", flowVisibilityYML, stdout, stderr, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem for regular users as well (outside of test code)? If yes, is this documented and did we consider having 2 YAML manifests instead to reduce the likelihood of the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's two steps with two yaml files in the user facing documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's one more argument in favor of deploying CH ou-of-band from the test cases.
test/e2e/framework.go
Outdated
ac := func(config *agentconfig.AgentConfig) { | ||
config.FeatureGates["FlowExporter"] = false | ||
// deployFlowVisibilityClickHouse deploys ClickHouse operator and DB. | ||
func (data *TestData) deployFlowVisibilityClickHouse() (*PodIPs, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would have been to refactor the test code so that for these tests we deploy Antrea, the FA, and ClickHouse "out-of-band" (so not as part of the test cases themselves). I think it tends to be a better and simpler approach. However, that doesn't exactly match how we have designed tests so far, so we can think about doing it as a follow-up.
test/e2e/framework.go
Outdated
return nil | ||
} | ||
|
||
func (data *TestData) mutateFlowAggregatorConfigMap(ipfixCollector string, faClusterIP string) error { | ||
func (data *TestData) mutateFlowAggregatorConfigMap(ipfixCollector, clickHouse string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these parameter names are too vague
maybe add the URL
suffix to them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if [ "$MODE" == "e2e" ]; then | ||
mkdir -p base/provisioning/datasources | ||
cp $KUSTOMIZATION_DIR/base/clickhouse.yml base/clickhouse.yml | ||
cp $KUSTOMIZATION_DIR/base/kustomization-e2e.yml base/kustomization.yml | ||
cp $KUSTOMIZATION_DIR/base/kustomize-config.yml base/kustomize-config.yml | ||
cp $KUSTOMIZATION_DIR/base/provisioning/datasources/create_table.sh base/provisioning/datasources/create_table.sh | ||
cp $KUSTOMIZATION_DIR/../clickhouse-operator-install-bundle.yml clickhouse-operator-install-bundle.yml | ||
|
||
$KUSTOMIZE edit add base base | ||
$KUSTOMIZE edit add base clickhouse-operator-install-bundle.yml | ||
$KUSTOMIZE edit add patch --path imagePullPolicyOperator.yml | ||
|
||
$KUSTOMIZE edit set image flow-visibility-clickhouse-monitor=projects.registry.vmware.com/antrea/flow-visibility-clickhouse-monitor:latest | ||
$KUSTOMIZE edit add patch --path imagePullPolicyClickhouse.yml --group clickhouse.altinity.com --version v1 --kind ClickHouseInstallation --name clickhouse | ||
else | ||
$KUSTOMIZE edit add base $BASE | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need so much customization compared to dev mode. Maybe it indicates that something should be improved?
Have you given any thought to what this will be like when the visibility solution is moved to the theia repo? Maybe you intend to improve things at that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
essentially what we're doing here is:
- re-create a kustomize folder to use kustomization-e2e.yml as base
kustomization.yml
, which only includes clickhouse db but not other grafana related manifest components - since we don't want to dup other shared files (patches / configs / db init script) in the flow-visibility dir, we're copying them from the base dir instead of building directly in its dedicated base dir. This makes it look like having a lot of customization, but actually half of the steps are just copying existing files.
- also includes
clickhouse-operator-install-bundle.yml
in one yaml file for easier manifest copying and deploying with test code. For release yaml it's a separate file.
Once FV manifest is moved out of Antrea we'll have a simplified folder structure dedicated for e2e test. At that time all these copying is not necessary.
cdf8e03
to
5fbb762
Compare
/test-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone from the Flow Visibility team should review this as well.
hack/generate-manifest.sh
Outdated
if $FLOW_EXPORTER; then | ||
HELM_VALUES+=("featureGates.FlowExporter=true") | ||
if [ "$MODE" == "dev" ]; then | ||
HELM_VALUES+=("flowCollector.flowPollInterval=1s" "flowCollector.activeFlowExportTimeout=2s" "flowCollector.idleFlowExportTimeout=1s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that these values (for flowCollector.activeFlowExportTimeout
& flowCollector.idleFlowExportTimeout
) don't really make sense outside of tests, yet we are applying them every time we generate the manifest in 'dev' mode.
Maybe you could modify generate-manifest.sh so it can take as parameters extra Helm values, or the path to an Helm values file. Alternatively you could generate the manifest you need using Helm directly, with your own values file, and without using generate-manifest.sh at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a new param --extra-helm-values-file
for generate-manifest.sh
and defined these test-facing values in a separate file.
test/e2e/framework.go
Outdated
|
||
statefulSetRestartAnnotationKey = "antrea-e2e/restartedAt" | ||
|
||
clickHousePort = "9000" | ||
clickHouseOperatorYMLUrl = "https://raw.githubusercontent.com/Altinity/clickhouse-operator/0.18.2/deploy/operator/clickhouse-operator-install-bundle.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also have a version of this manifest checked-in in this repo, maybe it makes more sense to use our version for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separated current fv deployment step into 2. first with the install bundle already checked-in in antrea, and then the clickhouse related definitions. we still need the error handling when applying ClickHouseInstallation CR though.
test/e2e/framework.go
Outdated
_, err = data.PodWaitFor(defaultTimeout, podName, kubeNamespace, func(p *corev1.Pod) (bool, error) { | ||
for _, condition := range p.Status.Conditions { | ||
if condition.Type == corev1.PodReady { | ||
return condition.Status == corev1.ConditionTrue, nil | ||
} | ||
} | ||
return false, nil | ||
}) | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use a more concise style for those:
if _, err := data.PodWaitFor(defaultTimeout, podName, kubeNamespace, func(p *corev1.Pod) (bool, error) {
...
}); err != nil {
return nil, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introduced a helper function to wait for ready status as @dreamtalen 's comment below.
test/e2e/fixtures.go
Outdated
tb.Logf("Applying flow aggregator YAML with ipfix collector address: %s", ipfixCollectorAddr) | ||
if err := testData.deployFlowAggregator(ipfixCollectorAddr); err != nil { | ||
tb.Logf("Deploying ClickHouse") | ||
chPodIPs, err := testData.deployFlowVisibilityClickHouse() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we are using the Pod IP to configure the Flow Aggregator. Do you know why we don't use the default instead (the Service DNS)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously the dns name may have issue when tested together with other Antrea kind e2e test modes. reverted to the default svc dns name as the case is isolated now.
@@ -323,6 +335,9 @@ func exportLogs(tb testing.TB, data *TestData, logsSubDir string, writeNodeLogs | |||
// dump the logs for flow-aggregator Pods to disk. | |||
data.forAllMatchingPodsInNamespace("", flowAggregatorNamespace, writePodLogs) | |||
|
|||
// dump the logs for flow-visibility Pods to disk. | |||
data.forAllMatchingPodsInNamespace("", flowVisibilityNamespace, writePodLogs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we get the logs from the CH operator Pod as well, just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
config.FeatureGates["FlowExporter"] = false | ||
// deployFlowVisibilityClickHouse deploys ClickHouse operator and DB. | ||
func (data *TestData) deployFlowVisibilityClickHouse() (*PodIPs, error) { | ||
err := data.CreateNamespace(flowVisibilityNamespace, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Flow Visibility NS creation is part of flowVisibilityYML, did you delete that intentionally in the e2e mode of generating manifest script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ns is defined under grafana yml which is not included as base here. I think maybe we can keep it as-is and revist this after theia migration.
9911dc0
to
a76de68
Compare
a7fefbd
to
ee93057
Compare
/test-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small comments, otherwise LGTM
} | ||
if err := wait.Poll(defaultInterval, defaultTimeout, func() (bool, error) { | ||
rc, stdout, stderr, err := testData.RunCommandOnNode(controlPlaneNodeName(), | ||
fmt.Sprintf("curl -Ss %s:%s", chSvc.Spec.ClusterIP, clickHouseHTTPPort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like assuming that curl
is available on the Nodes, but the only other solution is to start a Pod just to curl...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from kind node image dockerfile https://github.com/kubernetes-sigs/kind/blob/main/images/base/Dockerfile
we can see curl has been there for more than 3 years and actively being used as part of image build process. given its small size and common use case, i think the possibility of them suddenly remove it from the node image is quiet low. it's also used the same way in part of current antctl e2e test. given that fa tests are isolated now I think maybe we can always come back and and switch to the pod approach later.
test/e2e/framework.go
Outdated
rc, stdout, stderr, err := testData.RunCommandOnNode(controlPlaneNodeName(), | ||
fmt.Sprintf("curl -Ss %s:%s", chSvc.Spec.ClusterIP, clickHouseHTTPPort)) | ||
if rc != 0 || err != nil { | ||
log.Infof("Error curl clickhouse service: %s", strings.Trim(stderr, "\n")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Infof("Error curl clickhouse service: %s", strings.Trim(stderr, "\n")) | |
log.Infof("Failed to curl clickhouse Service: %s", strings.Trim(stderr, "\n")) |
do we use log.Info
in other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. it's used in several other places under the same file. for example:
func killProcessAndCollectCovFiles
: log.Infof("Sending SIGINT to '%s'", processName)
func DeleteNamespace
: log.Infof("Deleting Namespace %s took %v", namespace, time.Since(startTime))
@@ -168,8 +168,8 @@ FLOW_VIS_YML="/tmp/flow-visibility.yml" | |||
# If a flow collector address is also provided, we update the Antrea | |||
# manifest (to enable all features) | |||
if [[ $FLOW_COLLECTOR != "" ]]; then | |||
echo "Generating manifest with all features enabled along with FlowExporter feature" | |||
$THIS_DIR/../../../../hack/generate-manifest.sh --mode dev --all-features > "${ANTREA_YML}" | |||
echo "Generating manifest with flow exporter enabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FlowExporter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Shawn Wang <[email protected]>
This change introduces the following changes to fa e2e test: - only run fa e2e test if asserted - setup antrea agent with proper config as part of manifest generation - update github workflow for kind tests Signed-off-by: Shawn Wang <[email protected]>
- deploy ch operator in a separate step from in-repo yml bundle - define antrea side exporter configs from helm values file - extract ch operator logs as well in case of failure - update ch related image with flow-visibility prefix - test ping ch svc for success connectivity before apply fa Signed-off-by: Shawn Wang <[email protected]>
/test-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change ports various improvments for the current flow visibility e2e test in antrea main repo: antrea-io/antrea#3673 Other minor changes: - removed coverage related code and option from the test - changed busybox image to the correct one Signed-off-by: Shawn Wang <[email protected]>
This change ports various improvments for the current flow visibility e2e test in antrea main repo: antrea-io/antrea#3673 Other minor changes: - removed coverage related code and option from the test - changed busybox image to the correct one Signed-off-by: Shawn Wang <[email protected]>
This change ports various improvments for the current flow visibility e2e test in antrea main repo: antrea-io/antrea#3673 Other minor changes: - removed coverage related code and option from the test - changed busybox image to the correct one Signed-off-by: Shawn Wang <[email protected]>
This change ports various improvments for the current flow visibility e2e test in antrea main repo: antrea-io/antrea#3673 Other minor changes: - removed coverage related code and option from the test - changed busybox image to the correct one Signed-off-by: Shawn Wang <[email protected]>
This change ports various improvments for the current flow visibility e2e test in antrea main repo: antrea-io/antrea#3673 Other minor changes: - removed coverage related code and option from the test - changed busybox image to the correct one Signed-off-by: Shawn Wang <[email protected]>
This change ports various improvments for the current flow visibility e2e test in antrea main repo: antrea-io/antrea#3673 Other minor changes: - removed coverage related code and option from the test - changed busybox image to the correct one Signed-off-by: Shawn Wang <[email protected]>
This PR adds the following changes to flow visibility e2e:
--flow-visibility
flag is passed in